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ABSTRACT 

We develop a software kit called "2MASS Catalog Server Kit" to easily construct a high- 
performance database server for the 2MASS Point Source Catalog (includes 470,992,970 objects) 
and several all-sky catalogs. Users can perform fast radial search and rectangular search using 
provided stored functions in SQL similar to SDSS SkyServer. Our software kit utilizes open- 
source RDBMS, and therefore any astronomers and developers can install our kit on their personal 
computers for research, observation, etc. 

Out kit is tuned for optimal coordinate search performance. We implement an effective radial 
search using an orthogonal coordinate system, which does not need any techniques that depend 
on HTM or HEALpix. Applying the xyz coordinate system to the database index, we can easily 
implement a system of fast radial search for relatively small (less than several million rows) 
catalogs. To enable high-speed search of huge catalogs on RDBMS, we apply three additional 
techniques: table partitioning, composite expression index, and optimization in stored functions. 
As a result, we obtain satisfactory performance of radial search for the 2MASS catalog. Our 
system can also perform fast rectangular search. It is implemented using techniques similar to 
those applied for radial search. 

Our way of implementation enables a compact system and will give important hints for a 
low-cost development of other huge catalog databases. 



Subject headings: Data Analysis and Techniques 



1. Introduction 

Using huge object catalogs is more common in 
astronomical studies and observations. To support 
searching such catalogs, a variety of Web-based 
database services have been developed. For exam- 
ple, NASA/IPAC Infr ared Scien ce Archive (IRSA 
Berriman et al. l2000f) . VizieR (lOchsenbein et al 



2000) and Virtual Observatory (VO: ISzalayll200lh 



portal sites are widely used in astronomical com- 
munities. Some project teams distribute software 
to be installed in personal computers to use such 
services via a network. Astronomers and devel- 
opers can use various catalogs with Web browsers 
and such client-side software. 

On the other hand, there is software to search 
catalogs in offline environments. For examp le, 
"scat" in the WCSTools package (jMinkl 120061 ) is 



widely used in astronomical communities. Such 
software is important for observatories with unsta- 
ble or narrowband networks or for personal stud- 
ies that re quire huge catalog e ntries. Although 
SkyServer (IThakar et all l2004h of Sloan Digital 
Sky Survey fSDSS: lYork et al.ll2000h shows power- 
ful flexibility of the programming interfaces based 
on Structured Query Language (SQL) for catalog- 
search, it is not easy for end users to have such a 
high-performance search server in offline environ- 
ments. 

Therefore, we develop a software kit that en- 
ables any users to construct a database system 
based on a relational data base management sys- 
tem (RDBMS) in their personal computers and to 
quickly search a huge catalog with functions simi- 
lar to SDSS SkyServer. The first target of our kit 



1 



is the Two Micron All Sky Survey (|Skrutskie et al 
200fih Point Source Catalog (2MASS PSC), which 
is huge but frequently used in astronomy. We 
implement powerful functions into the kit using 
our various techniques. One of the features of our 
techniques is applying an xyz coordinate system 
for fast radial search of a huge catalog. This im- 
plementation might be rare and interesting for de- 
velopers, and we mainly report it in this article. 

Our software kit is built on publicly available 
software. In contrast, commercial RDBMS prod- 
ucts have been used to develop previous Web- 
based database se rvices of huge catal ogs such as 



SDSS SkyScrvcr (Thakar ct al 



Science Archive (Ha mblv et al 



.2004) . WFCAM 
2007), etc. This 



article will also demonstrate the true power of an 
open-source RDBMS. 

This article is organized as follows: First, our 
software kit is introduced in $2] We show our 
overview of software design in fJ3] In fJH we ex- 
plain details of our techniques and reasonable im- 
plementation for high-speed radial search of our 
software kit. We additionally report techniques 
for rectangular search in $5l Summary is given in 
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Fig. 1 — Web page of the 2MASS Catalog Server 
Kit. Users can easily install, use, and tune their 
own catalog database. This page includes enough 
contents for RDBMS beginners. 



Note that primitive data types of RDBMS arc 
displayed in the following way throughout this ar- 
ticle: INT2 is a signed two-byte integer, INT4 is a 
signed four-byte integer, and FL0AT8 is a double- 
precision floating-point number. 

2. 2MASS Catalog Server Kit 

The 2MASS Catalog Server Ki& (2MASS 
Kit) is a software package to construct a high- 
performance search server of 2MASS PSC and 
several all-sky catalogs on Linux, MacOSX, So- 
laris, and other UNIX systems. To install this 
kit, it is enough to prepare a standard personal 
computer with a hard drive of 600 Gbyte or more. 

The kit contains a complete data set of tables 
for 2MASS PSC, SQL statements and sources of C 
language with a Makefile. The HTML document 
(Fig. [l| shows step-by-step instructions for instal- 
lation, a tutorial for database beginners, reference 
of stored functions, and several hints for tuning. 



J ' http://www.ir.isas.jaxa.jp/~cyamauch/2masskit/. Soft- 
ware package and data set are available in this page. We 
are planning to support USNO-B1.0, PPMXL, GSC-2.3, 
etc. 



The instructions include a procedure for setting 
up RDBMS and configuration of the operating 
system; therefore, any users can easily construct 
their own catalog database servers. Users can 
search catalogs not only with flexible SQL but also 
with several useful stored functions prepared by 
our kit. Using the functions in users' SQL state- 
ments, users can perform fast radial and rectan- 
gular search with very small SQL statements with 
coordinate conversions (e.g., J2000 to Galactic). 
Of course, users do not have to know algorithm 
and indexing about typical searches. We show 
an example SQL statement of a radial search of 
Galactic coordinate (0,0) with 0.2' radius: 

SELECT f J2L(o.ra,o.dec) as 1, 
f J2B(o.ra,o.dec) as b, 
o. j_m,o.h_m,o.k_m, n. distance 
FROM fTwomassGetNearbyObjEq( 

fG2Ra(0,0) ,fG2Dec(0,0) ,0.2) n, 
twomass o 
WHERE n.objid = o.objid; 

where f J2L() and f J2B() convert J2000 to Galac- 
tic, fG2Ra() and fG2Dec() convert Galactic 
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to J2000, and f TwomassGetNearbyObjEqO is a 
stored function that performs fast radial search of 
2MASS PSC with an optimized algorithm. 

The kit is also characterized by its flexible tun- 
ing. Each table and index for 2MASS PSC is reg- 
istered in one of seven table spaces (each resides 
in a separate directory), thus allowing only the 
essential parts to be easily moved onto fast de- 
vices. Given the terrific evolution that has taken 
place with recent solid-state drives (SSDs) in per- 
formance, a very cost-effective way of constructing 
high-performance servers is moving part, or all, of 
the table indices to a fast SSD. 

Before installing the 2MASS Kit, users can 
confirm the performance of our kit using the 
SQL Search Tooff of the A KARI Catalogue 
Arch ive Server (AKARI-CAS; lYamauchi et al 



20111 ). AKARI-CAS is developed to search 



AKARI All-Sky Catalogues based on imaging 
data obtained by th e Far- Infrared Surveyor (FIS; 
iKawada et~aH 120071) and the Infrared Camera 
flRC; lOnaka et al.l l2007h built on AKARI satel- 
lite, and it also supports fast search for 2MASS 
PSC. To perform this search, AKARI-CAS has 
the source codes of 2MASS Kit. Of course, our 
kit utilizes open-source RDBMS, PostgreSQL-8.4, 
and therefore requires no software licensing fees. 

3. Overview of Software Design 

Choosing an appropriate RDBMS product is 
important for our software design, since functions 
to support various users' requirements depend on 
RDBMS products. 

Users will search the 2MASS PSC using var- 
ious criteria. Therefore, the RDBMS product 
should meet recent SQL standards and must have 
enough search performance. In addition, some 
users might not have knowledge about indices of 
tables. To support such users, the kit has to 
provide some functions to minimize SQL state- 
ments for typical searches in astronomy. There- 
fore, the RDBMS product should have high cod- 
ing flexibility of stored functions. We investigated 
some open-source RDBMS products when devel- 
oping AKARI-CAS. We found that PostgreSQL 
perf ectly satisfies the preced ing requirements. See 
also lYamauchi et al.l ([201 lh for an investigation 



about RDBMS products. 

Fast positional search is indispensable for as- 
tronomy, even if the database has huge catalogs. 
Fortunately, PostgreSQL has some special features 
to handle huge tables. For example, PostgreSQL 
supports table inheritance that is useful for ta- 
ble partitioning and has a "constraint-exclusion" 
feature to allow us a seamless access to the parti- 
tioned tables. To obtain the best performance of 
positional search, i.e., radial search and rectangu- 
lar search, we apply such features and write our 
codes in stored functions for more optimization. 

We mainly tune the performance of radial 
search, since it is most important for astronomical 
catalog databases. Our severe test for it is done 
by cross-identification as multiple radial searches. 
It is the main theme of this article. 

One of the advantages of PostgreSQL is having 
many built-in functions usable in SQL statements. 
Together with them, coordinate conversions shown 
in ij2]such as J2000-to-Galactic conversion are sup- 
ported by newly created stored functions that con- 
tain some codes of wcstools-3.80) Our kit pro- 
vides further func tions that are a l so ava ilable in 
AKARI-CAS. See lYamauchi et al.l f|201lh for gen- 
eral technical know-how for creating stored func- 
tions in PostgreSQL. 

4. Technical Design of Radial Search 
4.1. Basic Algorithm 

The radial search is the most typical query 
in the catalog database services. However, this 
search using RDBMS cannot be optimized in a 
straightforward way, because an index of RDBMS 
is useful for the case: 

Smin < f(cA n ,CB„,---,Cx n ) < S max , (1) 

where S m - ln and S mllx are the search criteria and 
c A n , c B n )••• are data of columns A, B, etc.; however, 
an index cannot be created for following case: 



S'min < f(cA n ,CB„, 5*1, S2) < S n 



(2) 



where Si and S% are also the search criteria. The 
radial search corresponds to the case ([2]), i.e., /() 
is a function that takes a pair of positions and 



2 ) http: / / darts. j axa. j p/ir / akari / cas /tools/ search / sql. html. 



3 ) http:/ /tdc-www. harvard.edu/wcstools/. 
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returns an angular distance, and S max is the search 
radius. 

To enable fast radial search applying the 
database index, some special methods based on 
spatial splitting have been devised, such as Hier- 
archi cal Triangular M eslJ^ (HTM; iKunszt et al 



2000h or HEALPb© (lG6rski et al1l2005l ). Their 
methods divide the sky into many areas, assign 
each area the unique ID, and give each object a 
corresponding ID from which the one-dimensional 
index is created. 

We do not use such techniques, but use the 
more simple and cost-effective way. Figure[5]shows 
the concept of our radial search. The most impor- 
tant point of this concept is the use of the xyz 
coordinate for the database index. 

In our databases, the object tables have columns 
of unit vectors (cx, cy, cz) presenting J2000 source 
positions. We create a composite index on (cx, cy, 
cz) and write stored functions to execute the fol- 
lowing procedure: 

1. Catch objects within a cube of the size 
2rx2rx2r using index scan on (cx, cy, cz). 

2. Select objects within the strict search circle 
on the celestial sphere from the result of step 
1. 

The feature of our algorithm is that it requires 
almost no calculation before executing the index 
scan, and the efficiency is quite high for a small 
search radius. In addition, we do not have to 
impl ement special processing for polar singular- 
ity. iTanakal ( 1993 ) pointed out the advantage of 
applying orthogonal coordinate system to avoid 
polar singularity. 

4.2. Implementation for Small Catalogs 

In this section, we present our implemen- 
tation and test results of radial search of the 
AKA RI/IRC Point Source Catalogue (AKARI/IRC 
PSC: llshihara et al-lfioToh . including 870,973 ob- 
jects. It is relatively small compared with the 
SDSS or 2MASS catalog; therefore, we implement 
the system only with the basic method described 

K 



region for radial search 
(radius = r) 




■ cube for index-scan\ 

A -of (cx, cy, cz) ', 



Fig. 2. — Concept of our radial search with search 
radius = r. Catch objects within a cube 2r each 
side using index scan on (cx,cy,cz), then select 
objects within the strict search area (striped pat- 
tern) from them. 



To enable fast radial search, we construct our 
table and index with the following procedure: 

1. Register all rows of all columns of AKARI/IRC 
PSC into a table Akarilrc. The table in- 
cludes a primary key objID of INT4 type, 
J2000 source position (ra, dec) of FL0AT8 
type and unit vector (cx, cy, cz) of FL0AT8 
type converted from (ra, dec). 

2. Create a composite index on (cx, cy, cz) by 
the following SQL statement!^ 

CREATE INDEX akariirc_xyz 

ON Akarilrc(cx,cy,cz) ; 

To perform a radial search, we create SQL 
stored functions. For example, the source code 
of a stored function to obtain an objID of the 
object whose distance from search position is the 
smallest in the search region is given next© 

CREATE FUNCTION 

f AkarilrcGetNearestObj IDEq(FLQAT8,FL0AT8 , FL0AT8) 



4 ) http://skyserver.org/HTM/. 

5 ) http://healpix.jpl.nasa.gov/. 



6 ' In the case of PostgreSQL, we have to do VACUUM 

ANALYZE; after creating indices. 
7 > See Appendix A for each stored function written in this 

definition. 
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Table 1: Results of match-up of AKARI/FIS 
BSC with AKARI/IRC PSC. The hardware is a 
Core2Quad Q9650 (3.0GHz) CPU on GIGABYTE 
GA-EX38-DS4 with 8 Gbyte DDR2-800 memory 
and an Adaptec RAID 2405 with 1 Tbyte SATA2 
HDDx2 (RAID1). 



Condition 


Elapsed time 


Just after OS rebooting 


42.0 s 


Second trial 


19.8 s 



RETURNS INT4 AS 
'SELECT o.objID 
FROM ( 
SELECT objID, 

fDistanceArcMinXYZ(fEq2X($l,$2) ,f Eq2Y($l , $2) , 
fEq2Z($l,$2) ,cx, cy.cz) as distance 

FROM Akarilrc 
WHERE 

(cx BETWEEN fEq2X($l,$2) 

fEq2X($l,$2) 
(cy BETWEEN fEq2Y($l,$2) 

fEq2Y($l,$2) 
(cz BETWEEN fEq2Z($l,$2) 

fEq2Z($l,$2) 



fArcMin2Rad($3) AND 
fArcMin2Rad($3)) AND 
fArcMin2Rad($3) AND 
fArcMin2Rad($3)) AND 
fArcMin2Rad($3) AND 
fArcMin2Rad($3)) 



) o 

WHERE o. distance <= $3 
ORDER BY o. distance 
LIMIT 1' 
IMMUTABLE LANGUAGE ' sql > ; 

where $1, $2, and $3 are the arguments of this 
stored function; ($1, $2) is the center position in 
J2000 of the search region; and $3 is the search 
radius. 

If we cutout from SELECT o.objID to WHERE 
o. distance <= $3 in the preceding source and 
execute it by giving actual values for $1, $2, and 
$3, it performs a radial search. 

To evaluate the performance of our radial 
search, we try a cross-identification as multiple ra- 
dial searches using f AkarilrcGetNearestObj IDEqO . 
We show our test results of matching up all ob- 
jects of AKARI FIS Bright Source CatalogucS 
(AKARI/FIS BSC; including 427,071 objects) 



with all AKARI/IRC PSC objects within 0.25' 
radius in Table [T] Here is the SQL statement for 
this test: 

SELECT count (f AkarilrcGetNearestObj IDEq(ra,dec , . 25) ) 
FROM AkariFis; 

This returns 19,267 matches. 

AKARI catalogs are small enough compared 
with memory capacities of present computers. 
Therefore, users generally have to wait only 30 
s or so, even for cross-identification. We can im- 
plement radial search of the catalogs including 
less than several million objects with our simple 
method by appl y ing th e xyz coordinate. See also 
Yamauchi et al.l ( 2011 ) for more applications for 



AKARI catalogs using our techniques. 

Implementation for 2MASS PSC 



Data Size Limit of Simple Radial Search 
Implementation 



4.3. 

4.3.1 



In i )4.2[ we store all the contents of a catalog 
into a table, create a composite index on (cx, cy, 
cz), and write a simple SQL stored function to 
perform a radial search. However, there is a limit 
of row numbers for this simple implementation de- 
scribed in £14.21 This limit is caused by two fac- 
tors: (1) enlargement of processing and (2) the 
bottleneck of disk I/O access. First, we show the 
behavior of factor 1 here. 

We test the performance of cross-identification 
as multiple radial searches. We use a subset of 
AKARI/IRC PSC and a subset of 2MASS PSC for 
the cross-identification. The table of the subset of 
2MASS PSC includes all columns and additional 
primary key objid of INT4 and unit vector (cx, 
cy, cz) of FL0AT8 calculated at the data registra- 
tion, and a composite index on the unit vector 
is created. To examine the performance depen- 
dency on the number of data entries, we prepare 
five cases; 5099652, 10226706, 20294711, 40688903 
and 82092729, selected by declination range from 
the south pole. Then we test the performance of 
cross-identification for each case. The following 
SQL statement is an example of the test: 



SELECT count ( 

f TwomassGetNearestObj IDEq(o . ra, 
FROM (SELECT * FROM Akarilrc 



o.dec, 0.25)) 



8 ) AKARI/FIS All-Sky Survey Bright Source Cata- 
logue Version 1.0 Release Note (Yamamura et al. 2010), 
http : //www . ir . isas . j axa . jp/AKARI/Observat ion/PSC/Public/RN/AKARI-FIS_BSC_Vl_RN . pdf 
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Fig. 3. — Performance test of simple radial 
search implementation applied for a huge catalog. 
We test cross-identification between a subset of 
AKARI/IRC PSC and subset of 2MASS PSC for 
different row sizes of the 2MASS PSC table; count 
of radial search per second is plotted against the 
number of rows in the 2MASS PSC table. Hard- 
ware is a 2x Opteron2384 CPU on a Supermicro 
dual-processor board with 32 Gbyte memory. 

WHERE dec < -74.600006) o; 

A criterion of 'dec < -74.60006' is to choose 
AKARI/IRC objects within the region corre- 
sponding to the 2MASS PSC subset. 

Figure [3] shows the result of our tests. In the 
case of the smallest data size about 5.1 million 
objects, searches of more than 3000 counts/s are 
performed. However, processing speed is rapidly 
dropped with increasing number of rows. Each 
measurement in Figure |3] was obtained from the 
median of the last three successive runs; i.e., all 
measurements were taken under the condition of 
enough cached data in the main memory. There- 
fore, this result means that it is impossible to ob- 
tain acceptable performance for severe search re- 
quirements with huge catalogs using this simple 
implementation, even if there is no bottleneck of 
disk I/O access. 



4-3.2. Strategy to Break the Data Size Limit 

As shown in §4.3.11 we cannot achieve sufficient 
performance under severe search requirements of 
huge catalogs using simple implementation. More- 
over, we can easily expect that the bottleneck of 
disk I/O becomes a serious problem for actual use. 
To break such limits, we consider to optimize the 
design of table relations, indices and stored func- 
tions. Our strategy for implementation of huge 
catalogs is as follows; 

1. Reduce the height of the nonunique index. 

2. Reduce the file size of the data set for per- 
forming a radial search. 

3. Note that unique indices (e.g., primary key) 
give enough performance for a huge table. 

4. Minimize CPU time for additional process- 
ing. 

5. Carry out experimental tuning. 

4-3.3. Design of Rable Relation 

Considering the strategy in 34.3. 2\ we deter- 
mine the design of table relation as follows: 

1. We apply the table partitioning technique. 

2. We prepare a special table set consisting of 
only necessary columns for each search pur- 
pose. 

3. We store the object positions into inte- 
ger (INT4) columns in this special table 
set. These integer values are converted and 
scaled from the original floating-point values 
of right ascension and declination. 

Figure 2] shows our design of table relation. 
The main table "twomass" has 470,992,970 en- 
tries (without partitioning ) and is basically sup- 
posed to be searched with the primary key. On the 
other hand, table partitioning is applied to the ta- 
ble sets "twomass_xyzi" and "twomass_j2000i" 
to reduce the height of the nonunique index in 
each table. These two table sets are optimized 
for radial search and rectangular search (see §3]), 
respectively. 

9 ' In our tests, partitioning of main table decreases the per- 
formance of join on the primary key. 
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twomass 
objid (INT4) 
ra 
dec 
err_maj 
errjnin 
err_ang 



main table 



twomass_xyzi 

objid (INT4) 
cxi (INT4) 
cyi (INT4) 
czi (INT4) 



child tables 



tables for radial search 



twomassJ2000i 

objid (INT4) 
rai (INT4) 
deci (INT4) 



child tables 



tables for rectangular search 

Fig. 4. — Design of table relation to enable fast 
positional search of 2MASS PSC. Columns (cxi, 
cyi, czi) and (rai, deci) are the unit vec- 
tors, right ascension and declination, respectively. 
Their integer values are converted and scaled from 
floating-point values in columns (ra, dec) in the 
twomass table. 



to preselect objects within a cube. Therefore, we 
can reduce the size of the index on (cxi, cyi, czi) 
using the composite expression index so that the 
index is created in INT2 type. 

We show an actual SQL statement to create one 
of the indices; 

CREATE INDEX twomass_xyzi_aaa0_il6xyz 
ON Twomass_xyzi_aaaO 

(fGetI16UVecI4(cxi, 32400) , 
fGetI16UVecI4(cyi, 32400) , 
fGetI16UVecI4(czi, 32400)) ; 

where Twomass_xyzi_aaa0 is the name of a child 
table and f GetI16UVecI4(ar3, 32400) scales and 
rounds the unit vector of INT4 into that of INT2 
having a range from —32400 to 32400. This gives 
about 9" of spatial resolution in the worst case; 
however, it does not cause any problems for typical 
searches. 

The data size of all indices on table sets 
twomass_xyzi is about 10 Gbyte, which is small 
enough to be stored in a RAM disk and enables 
faster file access. 



Table partitioning is supported via table in- 
heritance in PostgrcSQL. A parent table has col- 
umn definition and empty rows, and child tables 
have the same columns as those of the parent and 
a number of rows. We determine the contents 
of child tables with the range partitioning using 
values of declination. This partitioning is imple- 
mented so that child tables have almost the same 
number of rows. The optimal number of partitions 
is discussed in M4.3.6I 

The values of (cxi, cyi, czi) are converted 
from the original right ascension and declination 
and scaled between — 2xl0 9 and 2xl0 9 (integer) 
so that the spatial resolution is fine enough for 
astronomical object catalogs. When performing a 
radial search, this integer version of the unit vector 
is restored into floating-point values to calculate 
angular distance. 

The data size of the table set is about 20 Gbyte, 
which reduces disk read traffic and disk seek time. 

4-3.4- Index 

We can notice that spatial resolution of the 
composite index on (cxi, cyi, czi) does not have 
to be that of INT4, since the index is only used 



4-3.5. Stored Function 

PostgreSQL has a constraint-exclusion (CE) 
feature to allow us a seamless access to the par- 
titioned tables. If CHECK constraints are included 
in the definitions of child tables, the server parses 
an SQL statement referring a parent and accesses 
only necessary child table (s). We find that CE 
can improve the performance of a general one- 
time search0 However, it is still not enough 
for repeating a radial search many times within 
a small period of time. It is desired that cross- 
identification can be also performed only with an 
SQL statement that runs multiple radial searches. 

To improve the performance further, we cre- 
ate a stored function to access necessary child 
tables and perform a radial search. Although it 
is best to write the code in C from a performance 
point of view, an SQL execution in a C stored 
function is not supported in PostgreSQL. Alterna- 
tively, PostgreSQL offers dynamic SQL execution 
m PL/pgSQlQ; therefore, we create stored func- 
tions in both PL/pgSQL and C. We show a code 

10 ' We apply CE for rectangular search. See §E\ 
n ' PL/pgSQL is a procedural language for the PostgreSQL 
database system. 
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to create a PL/pgSQL function that obtains an 
ob j ID of the object nearest to the given position 
within the search region (this function is used for 
cross-identification) : 

CREATE FUNCTION f TwomassGetNearestObjIDEq( 

argl FL0AT8 , arg2 FL0AT8, arg3 FLDAT8) 
RETURNS INT4 
AS $$ 
DECLARE 

rt INT4; 
BEGIN 

EXECUTE _f TwomassGetSqlForRadialSearchO 

argl, arg2, arg3, 32400, TRUE 
) INTO rt; 
RETURN rt; 
END 

$$ IMMUTABLE LANGUAGE 'plpgsql'; 

where _f TwomassGetSqlForRadialSearchO is a 

stored function written in C that returns an SQL 
statement for radial search or cross-identification 
(i.e., multiple radial searches), and argl, arg2, 
and arg3 are right ascension, declination, and 
search radius, respectively. The fourth and fifth 
arguments of _f TwomassGetSqlForRadialSearchO 
are the scaling parameter of a composite expres- 
sion index on (cxi, cyi, czi) and a switch to select 
either radial search (FALSE) or cross-identification 
(TRUE), respectively. 

The function _f TwomassGetSqlForRadialSearchO 
knows range information of declination for each 
child table of twomass_xyzi and generates an ap- 
propriate SQL statement with the UNION ALlFr 
keyword (if required) to access necessary child 
table(s). We show an example SQL statement 
generated by the function that searches the near- 
est object from (0,0) in J2000 coordinates within 
a 10' radius: 

SELECT o.objid 
FROM 
(SELECT objid, 

f DistanceArcMinXYZI4(cxi , cyi , czi ,1,0,0) 
AS distance 
FROM twomass_xyzi_ace7 
WHERE (fGetI16UVecI4(cxi, 32400) 

BETWEEN 32305 AND 32400) AND 
(fGetI16UVecI4(cyi, 32400) 

! ) UNION ALL is used to merge two results of SELECT phrases. 



BETWEEN -95 AND 95) AND 
(fGetI16UVecI4 (czi, 32400) 
BETWEEN -95 AND 95) 
UNION ALL 
SELECT objid, 

f DistanceArcMinXYZI4 (cxi , cyi , czi ,1,0,0) 
AS distance 
FROM twomass_xyzi_baa0 
WHERE (fGetI16UVecI4(cxi, 32400) 

BETWEEN 32305 AND 32400) AND 
(fGetI16UVecI4 (cyi, 32400) 
BETWEEN -95 AND 95) AND 
(fGetI16UVecI4 (czi, 32400) 
BETWEEN -95 AND 95) 

) o 

WHERE o. distance <= 10 
ORDER BY o. distance 
LIMIT 1 

Here f DistanceArcMinXYZI4() is a stored func- 
tion written in C to obtain angular distance in 
arcminutes between two positions. 

General radial search is performed by using 
the f TwomassGetNearbyObjEqO function. The 
source code of it is almost the same as that 
of f TwomassGetNearestObjIDEqC) . See also the 
source files in our kit for details. 

4-3.6. Number of Partitions 

The number of partitions of twomass_xyzi is 
an important factor for performance. A public 
version of 2MASS PSC is provided as 92 files 
divided by declination|]3 Therefore, we create 
92 x n partitions and test the performance of cross- 
identification as multiple radial searches for n = 2, 
n = 4, n = 8, and n = 16. Figure [5] is the result 
of the test. All measurements were made with 
enough cached data in the main memory. The 
case of n = 8 shows significant improvement com- 
pared with n — 2 or n — 4. However, the n = 16 
case of radius = 0.25' exhibits a slight decrease of 
the performance. We suspect that increasing the 
UNION ALL phrase in n = 16 case causes the slow- 
down and that around n — 8 might be the best 
number of partitions. 

As a result, we apply n = 8 for our system, 
which achieves comparable performance with the 

') Strictly speaking, neighboring two files have some over- 
lapped area in declination. We have to be careful in the 
area, when selecting objects with strict declination ranges. 
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Fig. 5. — Performance of cross-identification be- 
tween all objects of AKARI/IRC PSC and all 
objects of 2MASS PSC for different numbers of 
partitions of 2MASS PSC. The count of radial 
search per second is plotted against the number 
of child tables of 2MASS PSC. Hardware is a 2x 
Opteron2384 CPU on a Supermicro dual-processor 
board with 64 Gbyte memory. 

case of 5 million objects in Figure [3] 

4.4. Performance with a Standard PC 

We test the performance of a radial search of 
2MASS PSC using a standard personal computer. 
Table [2] is the test results, including four cases of 
different search radii, and shows satisfactory per- 
formance. This test is made in the "psql" interac- 
tive terminal with the following SQL statement: 

SELECT count (*) 

FROM f TwomassGetNearbyObjEq(0 , 0, radius); 

after the \timing command. All measurements 
were made with enough cached data in the main 
memory; therefore, the actual search speed may 
be slower than them, due to the bottleneck of disk 
I/O. This bottleneck is generally reduced by per- 
forming many searches for a long span that in- 
creases memory cache efficiency 

In actual searches, we use this stored func- 
tion and natural join between returned result and 



Table 2: The performance of radial search and 
rectangular search of 2MASS PSC. Hardware is a 
Core2Quad Q9650 (3.0GHz) CPU on GIGABYTE 
GA-EX38-DS4 with 8 Gbyte DDR2-800 memory. 



Search criteria 


No. Obj 


Elapsed time 


Radial (r=T) 


2 


0.001 s 


Radial (r=60') 


5198 


0.022 s 


Radial (r=180') 


47632 


0.149 s 


Radial (r=360') 


189784 


0.484 s 


Rect. (2°x2°) 


6644 


0.025 s 


Rect. (10°xl0°) 


167266 


0.386 s 



twomass table: 

SELECT o . ra , o . dec , o . j _m , o . h_m , o . k_m , 

o . j _msigcom , o . h_msigcom , o . k_msigcom , 
n. distance 

FROM fTwomassGetNearby0bjEq(0,0,3) n, twomass o 
WHERE n. obj id = o. obj id; 

where a join on the primary key n. obj id = 
o. obj id works fast enough. See the document 
of our software kit for details. 

5. Technical Design of Rectangular Search 

As shown in Figure HI we implement rectan- 
gular search using an approach similar to that of 
radial search. Rectangular search is used as a one- 
time search in major cases; therefore, we imple- 
ment it so that we can obtain better performance 
with minimum cost. Although we introduced the 
CE feature of PostgreSQL in g3]and gXSl we did 
not use it for radial search. On the other hand, we 
found that CE is suitable for rectangular search, 
and it simplifies our implementation. 

We create 92 partitions (child tables) for rectan- 
gular search following the recommendation (less 
than 100 partitions) of the official document of 
PostgreSQL, and we distribute all rows into child 
tables divided by their declination. Then we cre- 
ate indices on all the child tables. Here is an 
example to create one pair of indices: 

CREATE INDEX twomass_j2000i_aaa_radeci 

ON Twomass_j2000i_aaa (rai.deci); 
CREATE INDEX twomass_j2000i_aaa_decrai 
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ON Twomass_j2000i_aaa (deci ,rai) ; 

where Twomass_j2000i_aaa is the name of a child 
table. After creating a stored function for rect- 
angular search in PL/pgSQL, we can run a fast 
rectangular search like this: 

SELECT o . ra , o . dec , o . j _m , o . h_m , o . k_m , 

o . j_msigcom, o .h_msigcom, o . k_msigcom 
FROM fTwomassGetObjFromRectEq(0,0. 1, 1,1.1) n, 

twomass o 
WHERE n.objid = o.objid; 

Note that we have to write the code of the 
stored function f TwomassGetObjFromRectEqO in 
PL/pgSQL, since a search must be performed as 
dynamic SQL execution. A stored function writ- 
ten in SQL only supports static SQL execution, 
under which CE does not work. 

Table [2] includes the performance of two cases of 
rectangular search. On average, the search speed 
with CE and partitioning is faster by more than 10 
times compared with the searches using an index 
on a single table. 

Our stored function for rectangular search has 
several other minor contrivances. See the docu- 
ment and source files in the 2MASS Kit for details. 

6. Summary 

We develop a software kit to construct a high- 
performance astronomical catalog database sup- 
porting 2MASS PSC and several all-sky catalogs 
on a standard personal computer. The kit has a 
document that includes step-by-step instructions 
for installation and a tutorial for database begin- 
ners, and it utilizes open-source RDBMS. There- 
fore, any users can easily build their own cat- 
alog server without software licensing fees and 
can search the catalogs with various criteria us- 
ing SQL. 

Out kit is tuned for optimal performance of po- 
sitional search, i.e., radial search and rectangular 
search. We use an orthogonal coordinate system 
for database index to implement the radial search 
that is most important in the positional search. 
This xyz-based method needs neither special pro- 
cessing for polar singularity nor spatial splitting 
such as HTM or HEALPix. Therefore, we can 
develop cost-effective astronomical database sys- 
tems. 



The implementation of radial search for rela- 
tively small (less than several million entries) cat- 
alogs can be very simple, and good performance is 
realized. We also show that such simple implemen- 
tation is not enough for the severe search require- 
ments of huge catalogs, and we need additional 
techniques. We examine our revised implemen- 
tation of radial search of the 2MASS PSC using 
techniques of table partitioning, composite expres- 
sion index, stored functions, etc. Our performance 
tests of cross-identification of AKARI/IRC PSC 
(870,973 objects) with 2MASS PSC (470,992,970 
objects) achieve about a 2000 counts/s radial 
search using a dual-processor server. Additional 
tests using a standard personal computer also 
show satisfactory performance of radial search 
with some typical search radii. 

We also present our simple implementation of 
fast rectangular search for which the constraint- 
exclusion (CE) feature of PostgreSQL works effec- 
tively to improve performance. 

Commercial RDBMS products have often been 
used for services for huge catalogs. Our report 
shows that an open-source RDBMS product is also 
a good choice to develop astronomical database 
services. 
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for his careful reading of the text. We thank Jun 
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cal tips of PostgreSQL. We thank the anonymous 
referee for his installation report and useful com- 
ments regarding this article. We thank Japan 
PostgreSQL Users Group for giving us the chance 
to talk about the 2MASS Kit at the PostgreSQL 
Conference 2011 in Japan. We thank Linux, Post- 
greSQL and other UNIX-related communities for 
the development of various useful software. 
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Appendix A: Stored Functions in §4.2 

Here, we explain each stored function in the 
definition of the f AkarilrcGetNearestObjIDEqO 
function in §4.2. All stored functions shown next 
are written in C. 

• fEq2X(ra,dec), f Eq2Y(ra,dec) and fEq2Z 
(ra.dec) convert J2000 (ra,dec) to (cx, cy, 
cz) of the unit vectors, respectively. 

• fDistanceArcMinXYZ(cxl, cyl, czl, cx2, 
cy2 , cz2) returns the angular distance 
(in arcminutes) between two positions, 
(cxl,cyl,czl) and (cx2,cy2,cz2). 

• f ArcMin2Rad (distance) converts the dis- 
tance in arcminutes to radians. 



Table 3: Results of match-up of AKARI/IRC PSC 
with 2MASS PSC using multiple radial searches 
with multiple sessions. Hardware is 2x XEON 
X5650 (2.67GHz six-core 12-thread) CPU on a 
Supermicro dual-processor board and LSI 921 l-4i 
HBA with a Crucial C300 SSD. 



Sessions 


Elapsed Time 


N of Radial Search 


1 


7.72 minutes 


1881 counts/s 


2 


4.55 minutes 


3189 counts/s 


4 


2.50 minutes 


5796 counts/s 


8 


1.50 minutes 


9697 counts/s 


10 


1.35 minutes 


10741 counts/s 


12 


1.28 minutes 


11333 counts/s 


18 


1.22 minutes 


11902 counts/s 


24 


1.22 minutes 


11898 counts/s 



Appendix B: Performance Tuning 

Throughout the tests of cross-identification pre- 
sented in this article, we use PostgreSQL 8.4.5 
with CentOS 5.5 64-bit on x86_64 compatible 
hardware. We have adjusted the following points 
to obtain the highest performance for our hard- 
ware: 

• Dynamic clocking of the CPU and others 
are disabled. We stop cpuspeed using the 
chkconfig command in OS and turn off 
C1E (Enhanced Halt) in BIOS. If they are 
enabled, the performance may decrease by 
more than 10 %. 

• We set the readahead parameter to 1024 us- 
ing the hdparm command. This sometimes 
improves by several percent compared with 
the default value. 

• We set the noatime option of the mount com- 
mand for database storage. 

Appendix C: Cross-Identification using 
SSD and Multicore CPU 

One of the best methods to perform cross- 
identification wit h huge catalogs is t he plane 



identification wit h huge catalogs is the plane 
sweep techniques ( Devereux et al.ll2005l N. Ham- 
bly 2011, private communication). Although 
2MASS Kit supports cross-identification as mul- 
tiple radial searches, such a RDBMS-based ap- 
proach might generally be unsuitable for the best 



performance. However, we found that satisfactory 
performance can be obtained using 2MASS Kit 
with recent inexpensive and high-speed SSDs and 
multicore CPUs. 

We store the PostgreSQL database files of the 
index set and table set for radial search (about 30 
Gbyte) into a Crucial C300 MLC SSD, and we test 
the performance of cross-identification between 
AKARI/IRC PSC and 2MASS PSC using six-core 
(12-thread) CPUs on a dual-processor board. We 
connect multiple sessions to the PostgreSQL server 
and execute the following SQL statements simul- 
taneously: 

SELECT count ( 

fTwomassGetNearestObjlDEqCo.ra, o.dec, 0.25)) 
FROM ( SELECT * FROM Akarilrc 

WHERE objid '/, n = m ORDER BY dec ) o; 

where n is the number of sessions, and a unique 
sequential number beginning with is assigned to 
to for each session. For example, we set n — 4 and 
to = 0, 1, 2 and 3 for cross-identification using four 
sessions (threads). 

The results are shown in Table [3l Less than 
10 sessions show significant scaling factor. Recent 
mainstream CPUs have four cores or more; there- 
fore, RDBMS-based multiple radial searches might 
become a good choice for several situations. 
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